General instructions for all assignments:
R Markdown file (named as: [AndrewID]-Lab08.Rmd – e.g. “sventura-Lab08.Rmd”) to the Lab 08 submission section on Blackboard. You do not need to upload the .html file.This week’s oral evaluation graphic: Comparing soccer players (from StatsBomb / Ted Knutson / [@mixedknuts](https://twitter.com/mixedknuts))
Sam Says: It’s important to avoid inserting your own commentary when discussing graphics Your job, as a statistician / data scientist / quantitative analyst, is to interpret the graph for the viewer. Discuss the facts presented via the data/graphic; avoid overstepping your boundaries by inserting your own opinions into what should be a presentation of facts.
Reminder, the following strategy is ideal when presenting graphs orally:
First, explain what is being shown in the graph. What is being plotted on each axis? What do the colors correspond to? What are the units (if applicable)? What are the ranges of different variables (if applicable)? Where does the data come from (if applicable)?
Next, explain the main takeaway of the graph. What do you want the viewer to understand after having seen this graph?
If applicable, explain any secondary takeaways or other interesting findings.
Finally, for this class, but not necessarily in general: Critique the graph. What do you like/dislike? What would you keep/change? Etc.
Load the Students’ Academic Performance dataset into R from the following location on the course GitHub page, and store it in a data.frame called students: https://raw.githubusercontent.com/sventura/315-code-and-datasets/master/data/students.csv
Brief data description:
For more information on the data, see here.
(5 points)
Use your theme. Do not directly copy the instructors’ theme.
For all graphs, do not use the default color scheme.
Set the options in the header of your .Rmd file so that all code is hidden. To do this, change code_folding: show to code_folding: hide at the top of your file.
(5 points each)
Distance Matrices
A distance matrix is a data structure that specifies the “distance” between each pair of observations in the original \(n\)-row, \(p\)-column dataset. For each pair of observations (e.g. \(x_i, x_j\)) in the original dataset, we compute the distance between those observations, denoted as \(d(x_i, x_j)\) or \(d_{ij}\) for short.
A variety of approaches for calculating the distance between a pair of observations can be used. The most commonly used approach (when we have continuous variables) is called “Euclidean Distance”. The Euclidean distance between observations \(x_i\) and \(x_j\) is defined as follows: \(d(x_i, x_j) = \sum_{l = 1}^p (x_{i,l} - x_{j,l}) ^ 2\). That is, it is the sum of squared differences between each column (\(l \in \{1, ..., p\}\)) of \(x_i\) and \(x_j\) (remember, there are \(p\) original columns / variables).
Note that if some variables in our dataset have substantially higher variance than others, the high-variance variables will dominate the calculation of distance, skewing our resulting distances towards the differences in these variables. As such, it’s common to scale the original continuous dataset before calculating the distance, so that each variable is on the same scale.
Load the Student Performance dataset into R, and store it in a data frame called student. Create a subset of the student dataset (called student_cont) that contains only the continuous variables (there should be 4 continuous variables – also, do not include Grade and AbsentDays in this).
Let’s examine the columns of student_cont to determine if the variables in our dataset have (roughly) equal variance (or standard deviation). To do this, type apply(student_cont, 2, sd) to calculate the standard deviation (sd) of the data stored in each column of student_cont. Comment on the resulting standard deviations of each column. Are any of them substantially larger or smaller than the others? Do you recommend scaling the variables before calculating the distances?
Look up the help documentation for the scale() function. What does it do? Use the scale() function to create a scaled version of student_cont. Store this in a data frame called student_cont_scale. Calculate the standard deviation of each column of student_cont_scale, similar to how you did this in part (b). What changed after scaling the variables?
Look up the help documentation for the dist() function. What does this function do? What is the default method for calculating the distance between each observation?
Create a distance matrix for the student_cont_scale dataset using the Euclidean distance method. Store this in an object called dist_student. Note that since we are comparing all \(n\) pairs of observations in the original dataset, there are \(n \choose 2\) comparisons. Verify that this is what happened: Type length(dist_student) and choose(nrow(student_cont_scale), 2). Are they the same? Does this make sense? Why or why not?
(BONUS: 1 point): One of the methods for calculating distance between observations in the dist() function is called Manhattan. How does this method work? Why is it called Manhattan? (You may want to do an internet search to help answer this.)
(10 points each)
Multi-Dimensional Scaling
As discussed in class last week, multi-dimensional scaling (MDS) tries to find the “best” \(k\)-dimensional projection of the original \(p\)-dimensional dataset (\(k < p\)).
By “best”, we mean that our resulting projection (which will have \(n\) rows and \(k\) columns/variables) will maintain the important features/structure of the original \(p\)-dimensional dataset (e.g. group structure).
As such, MDS tries to preserve the order of the pairwise distances stored in our distance matrix.
cmdscale() function.Use MDS to create a 2-dimensional projection of the student_cont_scale dataset. To do this, use cmdscale(dist_student, k = 2). Store this in a variable called student_mds. What is the dimensionality (use the dim() function) of the resulting dataset? Compare this to the dimensionality of student_cont_scale.
Wow! All of those steps from the last two problems, and we finally have a low-dimensional projection of our original dataset of continuous variables that we can work with. Let’s visualize it.
student_mds <- as.data.frame(student_mds).colnames(student_mds) <- c("mds_coordinate_1", "mds_coordinate_2"). Our resulting data.frame has 2 columns.Grade and AbsentDays to the data.frame using lines of code like student_mds <- mutate(student_mds, Grade = student$Grade, AbsentDays = student$AbsentDays), so that it now has four columns.student_mds data frame now?Create a scatterplot with the first MDS “coordinate” on the x-axis and the second MDS “coordinate” on the y-axis. Describe the resulting graph. Does there appear to be any group structure? If so, how many groups do you think there are? To which areas of the graph do the groups correspond? (There is not necessarily a right answer here – just try your best to answer given the visualization you have.)
Repeat part (d), but this time:
Grade.AbsentDays variable.scale_shape_manual() function to change the point types so that they correspond to the number of absent days (A for Above-7 and U for Under-7), i.e. + scale_shape_manual(values = c("Above-7" = "A", "Under-7" = "U")).size of the points as necessary in order to make the graph easier to read.Grade and/or AbsentDays?Grade categories are closely clustered together? Which are not as tightly clustered? (Hint: which colors are most closely clustered together?)AbsentDays categories are closely clustered together? Which are not as tightly clustered? (Hint: which text categories are most closely clustered together?)(10 points)
Dendrograms and Enhancing the User Experience
Code is provided for you for all of the following problems. All you have to do is edit your R Markdown document so that the resulting graphs are displayed in tabs, as specified below.
The ability to create statistical reports that are clear, concise, and easy for the user to view and understand is essential for modern statisticians / data scientists. There are a few additional tricks in R Markdown and R Studio that you can use to enhance your user’s experience when viewing your reports and graphics.
Remember, R Markdown output is typically stored as an HTML file (though other formats are also possible). As such, it’s common to post the resulting HTML file to the web for users to view and interact with. Read about the recent updates to R Markdown here. In that article, you’ll learn about:
Your assignment for this problem is to put the HTML output in tabs for parts (a) through (d) below. To do this, you may first need to install the dendextend package, which is used to visualize dendrograms – a specific type of plot associated with hierarchical clustering and visualizing high-dimensional structure. We’ll learn more about dendrograms next week.
library(tidyverse)
library(dendextend)
library(MASS)
data(Cars93)
colorblind_palette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
cars_cont <- dplyr::select(Cars93, Price, MPG.city, MPG.highway, EngineSize,
Horsepower, RPM, Fuel.tank.capacity, Passengers,
Length, Wheelbase, Width, Turn.circle, Weight)
dend <- cars_cont %>% scale %>% dist %>% hclust %>% as.dendrogram
get_colors <- function(x, palette = colorblind_palette) palette[match(x, unique(x))]
Create a dendrogram of some of the continuous variables in the Cars93 dataset.
Recreate the graph from (a), but this time, color the leaves of the dendrogram according to the Type of the car.
Recreate the graph from (b), but this time, change the leaf labels in the dendrogram to say the Type of the car.
Recreate the graph from (c), but this time, change the leaf labels in the dendrogram to say the Origin of the car instead of the Type of the car.
ggplot(dend, horiz = T)
dend %>% set("labels_col", get_colors(Cars93$Type), order_value = TRUE) %>%
ggplot(horiz = T)
dend %>% set("labels", Cars93$Type, order_value = TRUE) %>%
set("labels_col", get_colors(Cars93$Type), order_value = TRUE) %>%
ggplot(horiz = T)
dend %>% set("labels", Cars93$Origin, order_value = TRUE) %>%
set("labels_col", get_colors(Cars93$Type), order_value = TRUE) %>%
ggplot(horiz = T)
Cars93 dataset? That is, do cars with the same / similar Type values appear in the same area of the dendrogram (under the same branches)? Do cars with the same Origin (USA vs. non-USA) appear close to each other in the dendrogram?